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THERMODYNAMIC PROPENSITIES OF AMINO ACIDS IN THE NATIVE STATE 
ENSEMBLE: IMPLICATIONS FOR FOLD RECOGNITION 

[0001] This Applications claims priority to U.S. Provisional Application No. 
60/261,733, which was filed on January 16, 2001. 

[0002] The work herein was supported by grants from the United States 
Government. The United States Government may have certain rights in the invention. 

BACKGROUND OF THE INVENTION 

I. Field of the Invention 

[0003] The present invention relates to the field of structural biology. More 

particularly, the present invention relates to a protein database and methods of developing a 
protein database that contains all of the thermodynamic information necessary to encode a 
three-dimensional protein structure. 

II. Related Art 

[0004] It is a longstanding idea that protein structures are the result of an 
amino acid chain finding its global free energy minimum in the solvent environment 
(Anfinsen, 1973). Several exceptions to this so-called "thermodynamic control" have been 
discovered in recent years, including examples of proteins whose folding may be under 
"kinetic control" (Baker et al, 1992, Cohen, 1999) and proteins requiring information not 
completely contained in the amino acid sequence (e.g., chaperone-assisted folding (Feldman 
& Frydman 2000, Fink 1999)). Although thermodynamic control is widely accepted as the 
default behavior for correct folding (Jackson, 1998), a detailed understanding of the forces 
involved in thermodynamic control and how atomic interactions relate amino acid sequence 
to the folding and stability of the native structure has still proven elusive. 

[0005] Despite the progress that has been made in protein folding, obstacles 
have prevented an accurate structure prediction algorithm. An obstacle in developing an 
accurate structure prediction algorithm has been the lack of suitable potentials for calculating 
the free energies of different conformations of a given protein molecule. In 1992, high- 
pressure liquid chromatography (HPLC) was used to quantitate the energies of pairwise 
interactions between amino acid side chains (Pochapsky and Gopen, 1992). Yet further, in 
1999, Pochapsky used HPLC to further study the thermodynamic interactions between amino 
acid side chains. A stationary phase was prepared for use in an HPLC. The phase was 
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prepared by derivatizing microparticulate silica gels with functionality mimicking the side 
chain of hydrophobic and amphiphilic amino acid analytes (Pereira de Araujo et al, 1999). 
Thus, this variation of an HPLC method compares entropies and free energies of interaction 
using different derivatized microparticulate silica gels. 

[0006] The present invention uses a computer-based algorithm to address for 
the first time whether amino acid residue types have distinct preferences for thermodynamic 
environments in the folded native structure of a protein, and whether a scoring matrix based 
solely on thermodynamic information (independent of explicit structural constraints) can be 
u. used to identify correct sequences that correspond to a particular target fold. This is done by 

0 means of a unique approach in which the regional stability differences within a protein are 
J determined for a database of proteins using the COREX algorithm (Hilser & Freire, 1996). 
M The COREX algorithm generates an ensemble of states using the high-resolution structure as 
nj a template. Based on the relative probability of the different states in the ensemble, different 
T regions of the protein are found to be more stable than others. Thus, the COREX algorithm 
F 3 provides access to residue-specific free energies of folding. 

Z BRIEF SUMMARY OF THE INVENTION 

1 5 

lu [0007] One embodiment of the present invention is directed to a system and 

method of developing a protein database that contains all of the thermodynamic information 
necessary to encode a three-dimensional protein structure 

[0008] Another embodiment of the present invention comprises a protein 
database comprising nonhomologus proteins having known residue-specific free energies of 
folding of the proteins. In specific embodiments, the database comprises globular proteins. 

[0009] In further embodiments, the database is determined by a computational 
method comprising the step of determining a stability constant from the ratio of the summed 
probability of all states in the ensemble in which a residue j is in a folded conformation to the 
summed probability of all states in which j is in an unfolded conformation according to the 
equation, 
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K f ,j = 



[0010] Another specific embodiment of the present invention comprises that 
the stability constants for the residues are arranged into at least one of the three 
thermodynamic classification groups selected from the group consisting of stability, enthalpy, 
and entropy. 

[0011] In specific embodiments, the stability thermodynamic classification 
group comprises high stability, medium stability and low stability. More particularly, the 
residues in the high stability classification comprises phenylalanine, tryptophan and tyrosine. 
The residues in the low stability classification comprises glycine and proline. And the 
residues in the medium stability classification comprises asparagine and glutamic acid. 

[0012] Yet further, the enthalpy thermodynamic classification group 
comprises high enthalpy and low enthalpy. Enthalpy comprises a ratio of the contributions of 
polar and apolar components. 

[0013] In another specific embodiment, the entropy thermodynamic 
classification group comprises high entropy and low entropy. Entropy comprises a ratio of 
the contributions of polar and apolar components. 

[0014] In a further embodiment, the stability constants for the residues are 
arranged into twelve thermodynamic classifications selected from the group consisting of 
HHH, MHH, LHH, HHL, MHL, LHL, HLL, MLL, LLL, HLH, MLH and LLH. 

[0015] Another embodiment of the present invention is a method of developing 
a protein database comprising the steps of: inputting high resolution structures of proteins; 
generating an ensemble of incrementally different conformational states by combinatorial 
unfolding of a set of predefined folding units in all possible combinations of each protein; 
determining the probability of each said conformational state; calculating a residue-specific 
free energy of each said conformational state; and classifying a stability constant into at least 
one thermodynamic classification group selected from the group consisting of stability, 
enthalpy, and entropy. Specifically, the protein database comprises globular and 
nonhomologous proteins. 
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[0016] In specific embodiments, the generating step comprises dividing the 
proteins into folding units by placing a block of windows over the entire sequence of the 
protein and sliding the block of windows one residue at a time. 

[0017] In further specific embodiment, the determining step comprises 
determining the free energy of each of the conformational states in the ensemble; determining 
the Boltzmann weight [K, = exp(-AG;/RT)] of each state; and determining the probability of 
each state using the equation: 



P. = 



K, 



[0018] In specific embodiments, the calculating step comprises determining the 
energy difference between all microscopic states in which a particular residue is folded and 
all such states in which it is unfolded using the equation 

AG fj =-RT*lnK fJ 

[0019] Another embodiment of the present invention is a method of identifying 
a protein fold comprising determining the distribution of amino acid residues in different 
thermodynamic environments corresponding to a known protein structure. Specifically, 
determining the distribution of amino acid residues comprises constructing scoring matrices 
derived of thermodynamic information. The scoring matrices are derived from COREX 
thermodynamic information selected from the group consisting of stability, enthalpy, and 
entropy. 

[0020] The aforementioned embodiments of the present invention may be 
readily implemented as a computer-based system. One embodiment of such a computer- 
based system includes a computer program that receives an input of high resolution structure 
data for one or more proteins. The computer-based program utilizes this data to determine 
the amino acid thermodynamic classifications for the proteins. These amino acid 
thermodynamic classifications may then be stored in a database. The database of the system 
preferably has a data structure with a field or fields for storing a value for an amino acid 
name or amino acid abbreviation, and one or more classification fields for storing a numerical 
value for a thermodynamic classification for a particular amino acid. Additionally, this data 
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structure may have a field for storing a value representing the summed total of each of the 
numerical values for each thermodynamic classification for a particular amino acid. 

[0021] In one embodiment of the inventive system, the computer-based 
program performs a process to generate thermodynamic classifications for a protein which 
includes inputting high resolution structures of proteins, generating an ensemble of 
incrementally different conformational states by combinatorial unfolding of a set of 
predefined folding units in all possible combinations of each protein, determining the 
probability of each said conformational state, calculating a residue-specific free energy of 
each said conformational state, and classifying a stability constant into a thermodynamic 
classification group. Additionally, the computer-based program may have a probability 
determination module to determine the free energy of each of the conformational states in a 
computed ensemble, determine a Boltzmann weight, and then determine the probability of 
each state. 

[0022] Moreover, the computer-based program of the inventive system may 
have a display/reporting module for producing one or more graphical reports to a screen or a 
print-out. Some of these reports include: a display of a three-dimensional protein structure 
based on said amino acid thermodynamic classifications; a scatter-plot of normalized 
frequencies of COREX stability data versus normalized frequencies of average side chain 
surface exposure; and a chart displaying thermodynamic environments for amino acids of a 
protein. 

[0023] Another aspect of the inventive methods is that they may be stored as 
computer executable instructions on computer-readable medium. 

[0024] The foregoing has outlined rather broadly the features and technical 
advantages of the present invention in order that the detailed description of the invention that 
follows may be better understood. Additional features and advantages of the invention will 
be described hereinafter which form the subject of the claims of the invention. It should be 
appreciated by those skilled in the art that the conception and specific embodiment disclosed 
may be readily utilized as a basis for modifying or designing other structures for carrying out 
the same purposes of the present invention. It should also be realized by those skilled in the 
art that such equivalent constructions do not depart from the spirit and scope of the invention 
as set forth in the appended claims. The novel features which are believed to be 
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characteristic of the invention, both as to its organization and method of operation, together 
with further objects and advantages will be better understood from the following description 
when considered in connection with the accompanying figures. It is to be expressly 
understood, however, that each of the figures is provided for the purpose of illustration and 
description only and is not intended as a definition of the limits of the present invention. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0025] The following drawings form part of the present specification and are 
included to further demonstrate certain aspects of the present invention. The invention may 
be better understood by reference to one or more of these drawings in combination with the 
detailed description of specific embodiments presented herein. 

[0026] Figure 1 A and Figure IB are a schematic description of the COREX 
algorithm applied to the crystal structure of the ovomucoid third domain, OM3 (2ovo). 
Figure 1A summarizes the partitioning strategy of the COREX algorithm. Figure 1 B 
illustrates the solvent exposed surface area (ASA) contributing to the energetics of microstate 
32. 

[0027] Figure 2 is a comparison of hydrogen exchange protection factors 
predicted from COREX data with experimental values for ovomucoid third domain (2ovo). 
Unfilled vertical bars denote predicted values, and filled vertical bars denote experimental 
values (Swint-Kruse & Robertson, 1996). The solid line denotes lmcf values. The simulated 
temperature of the COREX calculation was set at 30 °C to match the experimental conditions. 
Secondary structure is given by labeled horizontal lines. Asterisks show the positions of Thr 
47 and Thr 49, referred to in the text. 

[0028] Figure 3A, Figure 3B, Figure 3C, Figure 3D, Figure 3E, Figure 3F, 
Figure 3G, Figure 3H, Figure 31, Figure 3J, Figure 3K, Figure 3L, Figure 3M, Figure 3N, 
Figure 3N, Figure 30, Figure 3P, Figure 3Q, Figure 3R, Figure 3S and Figure 3T comprise 
normalized frequencies of COREX stability data as a function of amino acid type. Figure 3 A 
shows the data as a function of the amino acid alanine. Figure 3B shows the data as a 
function of the amino acid arginine. Figure 3C shows the data as a function of the amino acid 
asparagine. Figure 3D shows the data as a function of the amino acid aspartic acid. Figure 
3E shows the data as a function of the amino acid cysteine. Figure 3F shows the data as a 
function of the amino acid glutamine. Figure 3G shows the data as a function of the amino 
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acid glutamic acid. Figure 3H shows the data as a function of the amino acid glycine. Figure 
31 shows the data as a function of the amino acid histidine. Figure 3J shows the data as a 
function of the amino acid isoleucine. Figure 3K shows the data as a function of the amino 
acid leucine. Figure 3L shows the data as a function of the amino acid lysine. Figure 3M 
shows the data as a function of the amino acid methionine. Figure 3N shows the data as a 
function of the amino acid phenylalanine. Figure 30 shows the data as a function of the 
amino acid proline. Figure 3P shows the data as a function of the amino acid serine. Figure 
3Q shows the data as a function of the amino acid threonine. Figure 3R shows the data as a 
function of the amino acid tryptophan. Figure 3S shows the data as a function of the amino 
acid tyrosine. Figure 3T shows the data as a function of the amino acid valine. In each 
histogram, the low stability bin is on the left, the medium stability bin is in the middle, and 
the high stability bin is on the right. The data used in each histogram was taken from the 
2922 residue data set, as given in Table 2. 

[0029] Figure 4 is a scatterplot of normalized frequencies of COREX stability 
data versus normalized frequencies of average side chain surface area exposure. Average 
side chain exposure in the native structure was calculated by using a moving window of five 
residues, similar to the basis of the COREX algorithm. These values were then binned into 
high, medium, and low surface area exposure. 

[0030] Figure 5A, Figure 5B, Figure 5C and Figure 5D illustrate a summary of 
fold-recognition results for COREX stability and DSSP secondary structure scoring matrices 
for 44 targets. Black bars denote real data (either lnicf or secondary structure), and striped 
bars denote the average of three random data sets. Figure 5 A shows the lnicf scoring matrix 
local alignment algorithm. Figure 5B shows the lmcf scoring matrix global alignment 
algorithm. Figure 5C shows the secondary structure scoring matrix local alignment 
algorithm. Figure 5D shows the secondary structure scoring matrix global alignment 
algorithm. 

[0031] Figure 6A, Figure 6B and Figure 6C illustrate examples of successful 
local alignment for three targets. Results for target ligd (Protein G) are shown in Figure 6 A, 
results for target lvcc (DNA topoisomerase I) are shown in Figure 6B, and results for target 
2ait (tendamistat) are shown in Figure 6C. The thin black line represents COREX calculated 
stability data (lmcf) for the protein target. The filled circles connected by a thick black line 
correspond to the cumulative matrix score contributed by each residue. Scores that did not 
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contribute to the final score due to the rules of the local alignment algorithm (Smith & 
Waterman, 1981) are shown as unfilled circles connected by a thick dashed line. 

[0032] Figure 7 is a correlation between stability data derived from the database 
of 44 proteins used in this work and stability data derived from an independent database of 50 
proteins. Data on the x-axis are taken from the normalized histograms in Figure 3A-Figure 
3T. Data on the y-axis are derived from an identical COREX analysis of an independent 
database of 3304 residues from 50 PDB structures not contained in the original database. 
Open circles denote the values for His, a residue type with low statistics in both databases. 
The dashed line represents a perfect correlation. 

B 

□ [0033] Figure 8A and Figure 8B illustrate the results of a COREX calculation 

SJ for the bacterial cold-shock protein cspA (PDB lmjc). Figure 8A shows a plot of calculated 

cjl thermodynamic stability, lnK#, as a function of residue number for cspA. The simulated 

*P temperature was 25.0°C. Regions of relatively high, medium, and low stability, are shown in 

O dark gray, light gray, and black, respectively. Secondary structure elements, as defined by 

U the program DSSP, (Kabsch and Sander, 1983) are labeled. Figure 8B locates the relative 

Ml calculated stabilities of each residue in the lmjc crystal structure. Note that a given 

f| secondary structural element is predicted to have varying regions of stability, and that the 

most stable regions of the molecule are often, but not necessarily, within the hydrophobic 

core. 

[0034] Figure 9A, Figure 9B and Figure 9C illustrate a description of protein 
structure in terms of thermodynamic environments. Figure 9A shows the thermodynamic 
environment classification scheme used herein. Three quantities derived from the output of 
the COREX algorithm, stability (k#), enthalpy ratio (H rfl/i£V ), and entropy ratio (S ra tioj) 
describe the thermodynamic environment of each residue. Figure 9B shows the 12 
thermodynamic environments defined by this classification scheme in a schematic describing 
protein energetic phase space. Each cube represents a region dominated by certain stability, 
enthalpy, and entropy characteristics. Every residue position in the protein structures used 
herein lies somewhere within this phase space. Figure 9C shows examples of the distribution 
of thermodynamic environments of (Figure 9B) in three proteins with varying types and 
amounts of secondary structure. Note that single secondary structure elements do not exhibit 
unique thermodynamic environments. 
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[0035] Figure 10A, Figure 10B, Figure IOC, Figure 10D, Figure 10E, Figure 
10F, Figure 10G, Figure 10H, Figure 101, Figure 10 J, Figure 10K and Figure 10L show 3D- 
1D scores relating amino acid types to 12 protein structural thermodynamic environments. 
The three-letter abbreviation in each panel represents the stability, enthalpic, and entropic 
descriptor of the thermodynamic environment. Stability is classified into high, medium and 
low. Entropy and enthalpy are classified into high and low. Figure 10A represents LHH, 
which is a protein thermodynamic environment of low stability, high polar/apolar enthalpy 
ratio, and high conformational entropy/Gibbs' solvation energy ratio. Figure 10B represents 
LHL, which is a protein thermodynamic environment of low stability, high polar/apolar 
enthalpy ratio, and low conformational entropy/Gibbs' solvation energy ratio. Figure 10C 
represents LLH, which is a protein thermodynamic environment of low stability, low 
polar/apolar enthalpy ratio, and high conformational entropy/Gibbs' solvation energy ratio. 
Figure 10D represents LLL, which is a protein thermodynamic environment of low stability, 
low polar/apolar enthalpy ratio, and low conformational entropy/Gibbs' solvation energy 
ratio. Figure 10E represents MHH, which is a protein thermodynamic environment of 
medium stability, high polar/apolar enthalpy ratio, and high conformational entropy/Gibbs' 
solvation energy ratio. Figure 10F represents MHL, which is a protein thermodynamic 
environment of medium stability, high polar/apolar enthalpy ratio, and low conformational 
entropy/Gibbs' solvation energy ratio. Figure 10G represents MLH, which is a protein 
thermodynamic environment of medium stability, low polar/apolar enthalpy ratio, and high 
conformational entropy/Gibbs' solvation energy ratio. Figure 10H represents MLL, which is 
a protein thermodynamic environment of medium stability, low polar/apolar enthalpy ratio, 
and low conformational entropy/Gibbs' solvation energy ratio. Figure 101 represents HHH, 
which is a protein thermodynamic environment of high stability, high polar/apolar enthalpy 
ratio, and high conformational entropy/Gibbs' solvation energy ratio. Figure 10J represents 
HHL, which is a protein thermodynamic environment of high stability, high polar/apolar 
enthalpy ratio, and low conformational entropy/Gibbs' solvation energy ratio. Figure 10K 
represents HLH, which is a protein thermodynamic environment of high stability, low 
polar/apolar enthalpy ratio, and high conformational entropy/Gibbs' solvation energy ratio. 
Figure 10L represents HLL, which is a protein thermodynamic environment of high stability, 
low polar/apolar enthalpy ratio, and low conformational entropy/Gibbs' solvation energy 
ratio. 
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[0036] Figure 11 shows fold-recognition results for 81 protein targets using a 
scoring matrix composed of thermodynamic information from protein structures. The 
horizontal axis represents the percentile ranking of the score against the target structure for 
the sequence corresponding to the target structure. For example, the sequence corresponding 
to the target cold-shock protein (PDB lmjc) received the 157 th highest score of 3858 
sequences against the cold-shock protein thermodynamic profile. This result placed the 
sequence for the cold-shock protein in the 5th percentile bin in Figure 1 1 . When aligned with 
their respective thermodynamic profiles, the majority (44/81) of sequences scored better than 
99% of the 3858 sequences in the database. 

[0037] Figure 12 shows fold-recognition results for 12 all-beta protein targets 
using a scoring matrix composed of thermodynamic information from 31 all-alpha protein 
structures. The horizontal axis represents the percentile ranking of the score against the 
target structure for the sequence corresponding to the target structure. For example, the 
sequence corresponding to the all-beta target tendamistat (PDB lhoe) received the 26 th 
highest score of 3858 sequences against the tendamistat thermodynamic profile. This result 
placed the tendamistat sequence in the 5 th percentile bin in Figure 5. All 12 sequences 
corresponding to beta targets scored better against their respective targets than 90% of the 
3858 sequences in the database. 

DETAILED DESCRIPTION OF THE INVENTION 

[0038] It is readily apparent to one skilled in the art that various embodiments 
and modifications may be made to the invention disclosed in this Application without 
departing from the scope and spirit of the invention. 

[0039] As used herein the specification, "a" or "an" may mean one or more. As 
used herein in the claim(s), when used in conjunction with the word "comprising", the words 
"a" or "an" may mean one or more than one. As used herein "another" may mean at least a 
second or more. 

[0040] The term "conformation" as used herein refers various 
nonsuperimposable three-dimensional arrangements of atoms that are interconvertible 
without breaking covalent bonds. 
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[0041] The term "configuration" as used herein refers to different 
conformations of a protein molecule that have the same chirality of atoms. 

[0042] The term "database" as used herein refers to a collection of data 
arranged for ease of retrieval by a computer. Data is also stored in a manner where it is easily 
compared to existing data sets. 

[0043] The term "enthalpy" as used herein refers to a thermodynamic state or 
environment in which the enthalpy of internal interactions and the hydrophobic entropy 
change the favor of protein folding, thus enthalpy is a thermodynamic component in the 
thermodynamic stability of globular proteins. Enthalpy is a ratio of polar and apolar 

AH f j 

contributions {H ratio j = ). 

[0044] The term "entropy" as used herein refers to a thermodynamic state or 
environment in which the conformation entropy change works against folding of proteins. 
Entropy is a ratio the conformational entropy to total solvation free energy 

V ^ ratio J a /~* '* 

A G S oivj 

[0045] The term "globular protein" as used herein refers to proteins in which 
their polypeptide chains are folded into compact structures. The compact structures are 
unlike the extended filamentous forms of fibrous proteins. A skilled artisan realizes that 
globular proteins have tertiary structures which comprises the secondary structure elements, 
e.g., helices, (3 sheets, or nonregular regions folded in specific arrangements. An example of 
a globular protein includes, but is not limited to myoglobin. 

[0046] The term "peptide" as used herein refers to a chain of amino acids with a 
defined sequence whose physical properties are those expected from the sum of its amino 
acid residues and there is no fixed three-dimensional structure. 

[0047] The term "polyamino acids" as used herein refers to random sequences 
of varying lengths generally resulting from nonspecific polymerization of one or more amino 
acids. 
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[0048] The term "protein" as used herein refers to a chain of amino acids 
usually of defined sequence and length and three dimensional structure. The polymerization 
reaction, which produces a protein, results in the loss of one molecule of water from each 
amino acid, proteins are often said to be composed of amino acid residues. Natural protein 
molecules may contain as many as 20 different types of amino acid residues, each of which 
contains a distinctive side chain. 

[0049] The term "protein fold" as used herein refers to an organization of a 
protein to form a structure which constrains individual amino acids to a specific location 
relative to the other amino acids in the sequence. One of skill in the art realizes that this type 
of organization of a protein comprises secondary, tertiary and quartemary structures. 

[0050] The term "thermodynamic environment" as used herein refers to the 
various thermodynamic components that contribute to the folding process of a protein. For 
example, stability, entropy and enthalpy thermodynamic environments contribute to the 
folding of a protein. One skilled in the art realizes that the terms "thermodynamic 
environment", "thermodynamic classification" or "thermodynamic component" are 
interchangeable. 

[0051] There is a hierarchy of protein structure. The primary structure is the 
covalent structure, which comprises the particular sequence of amino acid residues in a 
protein and any postradiational covalent modifications that may occur. The secondary 
structure is the local conformation of the polypeptide backbone. The helices, sheets, and 
turns of a protein's secondary structure pack together to produce the three-dimensional 
structure of the protein. The three-dimensional structure of many proteins may be 
characterized as having internal surfaces (directed away from the aqueous environment in 
which the protein is normally found) and external surfaces (which are in close proximity to 
the aqueous environment). Through the study of many natural proteins, researchers have 
discovered that hydrophobic residues (such as tryptophan, phenylalanine, tyrosine, leucine, 
isoleucine, valine or methionine) are most frequently found on the internal surface of protein 
molecules. In contrast, hydrophilic residues (such as asparate, asparagine, glutamate, 
glutamine, lysine, arginine, histidine, serine, threonine, glycine, and proline) are most 
frequently found on the external protein surface. The amino acids alanine, glycine, serine 
and threonine are encountered with equal frequency on both the internal and external protein 
surfaces. 
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[0052] An embodiment of the present invention is a protein database 
comprising nonhomologous proteins having known residue-specific free energies of folding 
of the proteins. 

[0053] One of skill in the art is cognizant that the properties of proteins are 
governed by their potential energy surfaces. Proteins exist in a dynamic equilibrium between 
a folded, ordered state and an unfolded, disordered state. This equilibrium in part reflects the 
interactions between the side chains of amino acid residues, which tend to stabilize the 
protein's structure, and, on the other hand, those thermodynamic forces which tend to 
promote the randomization of the molecule. 

[0054] The present invention utilizes a computational method comprising the 
step of determining a stability constant from the ratio of the summed probability of all states 
in the ensemble in which a residue j is in a folded conformation to the summed probability of 
all states in which j is in an unfolded conformation according the equation, 

[0055] One of skill in the art is cognizant that although the stability constant is 
defined for each position, the value obtained at each residue is not the energetic contribution 
of that residue. The stability constant is a property of the ensemble as a whole. For each 
partially unfolded microstate, the energy difference between it and the fully folded reference 
state is determined by the energetic contributions of all amino acids comprising the folding 
units that are unfolded in each microstate, plus the energetic contributions associated with 
exposing additional (complimentary) surface area on the protein (Figure IB). The stability 
constant thus provides the average thermodynamic environment of each residue, wherein 
surface area, polarity, and packing are implicitly considered. Thus, the stability constant 
provides a thermodynamic metric wherein each of these static structural properties is 
weighted according to its energetic impact at each position. 

[0056] The stability constants for the residues are arranged into three 
classifications of stability selected from the group consisting of high, medium and low. 
Specifically, the residues in the high stability classification comprises phenylalanine, 
tryptophan and tyrosine. The residues in the low stability classification comprises glycine 
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and proline. The residues in the medium stability classification comprises asparagine and 
glutamic acid. 

[0057] In the present invention, the classifications of high, medium and low are 
determined based upon inspection of the InKf value for each protein in the selected database. 
Thus one of skill in the art is cognizant that these classifications are relative and may vary 
depending upon the proteins that are selected for the database. One of skill in the art 
recognizes that these classifications can be subclassified by a variety of other parameters, for 
example, but not limited to enthalpy and entropy. Thus, any given position in a structure may 
be represented by two or more parameters, for example, but not limited to low stability (InKf) 
and high enthalpy. Yet further, additional parameters can be used to further divide the 
categories of enthalpy and entropy, for example, but not limited to conformational entropy, 
solvent entropy, polar enthalpy, apolar enthalpy, polar entropy or apolar entropy. Thus, any 
given position in a structure may have a description such as, but not limited to low stability, 
high apolar enthalpy, high polar enthalpy, medium conformational entropy and high apolar 
entropy One of skill in the art realizes that these classifications allow for better resolution 
and consequently, better performance in identifying the correct protein fold for a given 
protein sequence or a portion of a given protein sequence. Further one of skill in the art is 
cognizant a protein fold refers to the secondary structure of the protein, which includes 
sheets, helices and turns. 

[0058] Another specific embodiment of the present invention comprises that the 
stability constants for the residues are arranged into at least one of the three thermodynamic 
classification groups selected from the group consisting of stability, enthalpy, and entropy. 

[0059] Specific embodiments of the present invention provide that the 
database comprises globular and nonhomologous proteins. A skilled artisan is cognizant that 
globular proteins are used to study protein folding. It is contemplated that the computational 
method of the present invention may be used for a variety of globular proteins including but 
not limiting to glutacorticoid receptor like DNA binding domain, histone, acyl earner protein 
like anti LPS facto/RecA domain, lambda repressor like DNA binding domains, EF hand 
like' insulin like bacterial Ig/albumin binding, barrel sandwich hybrid, p-loop containing NTP 
hydrolases, RING finger domain C3HC4, crambin like, ribosomal protein L7/12 C-terminal 
fragment, cytochrome c, SAM domain like, KH domain, RNA polymerase subumt H, beta- 
grasp (ubiquitin-hke), rubredoxin like, HiPiP, anaphylotoxins (complement system), 
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ferrodoxin like, OB fold, midkine, HMG box, saposin, HPr proteins, knottins, HTV-1 Nef 
protein fragments, thermostable subdomain from chicken villin, SIS/NS1 RNA binding 
domain, SH3 like barrel, DNA topoisomerase I domain, IL8 like, de novo designed single 
chain 3 helix bundle, alpha amylase inhibitor tendamistat, CI2 family of serine protease 
inhibitors protease inhibitors, protozoan pheromone proteins, ConA like lectins/glucoanases, 
ovomucoid/PCI- 1 like inhibitors, beta clip, snake toxin like and BPTI like. Other globular 
proteins may be selected from the Protein Data Bank. 

[0060] One of skill in the art also recognizes that the present invention is not 
limited to small molecular proteins. A skilled artisan is cognizant that the computational 
method used in the present invention can be used on larger proteins. Thus, there is not a size 
limit to the proteins that can be used in the present invention. 



V [0061] Another embodiment of the present invention is a method of developing 

a protein database comprising the steps of: inputting high resolution structures of proteins; 
generating an ensemble of incrementally different conformations by combinatorial unfolding 
of a set of predefined folding units in all possible combinations of each protein; determining 
the probability of each said conformational state; calculating the residue-specific free energy 
of each conformational state; and classifying a stability constant into at least one 
thermodynamic environment selected from the group consisting of stability, enthalpy, and 
entropy. 

[0062] In specific embodiments, the generating step comprises dividing the 
proteins into folding units by placing a block of windows over the entire sequence of the 
protein and sliding the block of windows one residue at a time. 

[0063] One of skill in the art is cognizant that the division of a protein into a 
given number of folding units is a partition. Thus, to maximize the number of partially 
folded states, different partitions are used in the analysis. The partitions can be defined by 
placing a block of windows over the entire sequence of the protein. The folding units are 
defined by the location of the windows irrespective of whether they coincide with specific 
secondary structure elements. By sliding the entire block of windows one residue at a time, 
different partitions of the protein are obtained. For two consecutive partitions, the first and 
last amino acids of each folding unit are shifted by one residue. This procedure is repeated 
until the entire set of partitions has been exhausted. In specific embodiments, windows of 5 
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or 8 amino acid residues are used. One of skill in the art realizes that approximately 10 
partially folded conformations can be generated using the COREX algorithm. This value can 
be altered by increasing or decreasing the window size and the size of the protein. For 
example, for the proteins A,6-85, chymotrypsin inhibitor 2 and barnase, windows sizes of 5, 5, 
8 and amino acid residues results in 2.6 x 10 5 , 0.4 x 10 5 , and 1.1 x 10 5 partially folded 
conformations, respectively. 

[0064] In further embodiments, the determining step comprises determining the 
free energy of each of the conformational states in the ensemble; determining the Boltzmann 
weight [K/ = exp(-AG,/RT)] of each state; and determining the probability of each state using 
the equation, 



[0065] Yet further, the calculating step comprises determining the energy 
difference between all microscopic states in which a particular residue is folded and all such 
states in which it is unfolded using the equation, 

AG /y = —RT • lnK f j 

[0066] One of skill in the art is aware that the COREX algorithm generates a 
large number of partially folded states of a protein from the high resolution crystallographic 
or NMR structure (Hilser & Freire, 1996; Hilser & Freire, 1997 and Hilser et al, 1997). In 
this algorithm, the high resolution structure is used as a template to approximate the ensemble 
of partially folded states of a protein. Thus, the protein is considered to be composed of 
different folding units. The partially folded states are generated by folding and unfolding 
these units in all possible combinations. There are two basic assumptions in the COREX 
algorithm: (1) the folded regions in partially folded states are native-like; and (2) the unfolded 
regions are assumed to be devoid of structure or lacking structure. Thermodynamic 
quantities, e.g., AH, AS, ACp, and AG, partition function and probability of each state (P/) are 
evaluated using an empirical parameterization of the energetics (Murphy & Freire, 1992; 
Gomez et al, 1995; Hilser et al, 1996; Lee et al, 1994; D'Aquino et al, 1996; and Luque et 
al, 1996). 
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[0067] Yet further, a skilled artisan is cognizant that the residue specific 
equilibrium provide quantitative agreement with those obtained experimentally from amide 
hydrogen exchange experiments, e.g., hydrogen protection factors (Hilser & Freire, 1996; 
Hilser & Freire, 1997; and Hilser et al., 1997). 

[0068] One of skill in the art realizes that while the residue stability constants 
are purely thermodynamic quantities defined for all residues, the protection factors also 
contain non-thermodynamic contributions and are defined for a subset of residues. 

[0069] Another embodiment of the present invention is a method of identifying 
a protein fold comprising determining the distribution of amino acid residues in different 
thermodynamic environments corresponding to a known protein structure. More particularly, 
determining the distribution of amino acid residues comprises constructing scoring matrices 
derived of thermodynamic information. Specifically, the scoring matrices are derived from 
COREX thermodynamic information, such as stability, enthalpy, and entropy. Thus, 
COREX-derived thermodynamic descriptors can be used to identify sequences that 
correspond to a specific fold. 

[0070] A skilled artisan recognizes that the COREX algorithm provides a 
means of estimating the energetic variability in the native state of proteins, and uses this 
information to illuminate the relation between amino acid sequence and protein structure. 
Therefore, the thermodynamic information obtained by the COREX algorithm represents a 
fundamental descriptor of proteins that transcends secondary structure classifications. 

[0071] Protein folds can be considered as one of the most basic molecular parts. 
A skilled artisan recognizes that the properties related to protein folds can be divided into two 
parts, intrinsic and extrinsic. The intrinsic properties relates to an individual fold, e.g., its 
sequence, three-dimensional structure and function. Extrinsic properties relates to a fold in 
the context of all other folds, e.g., its occurrence in many genomes and expression level in 
relation to that for other folds. 

[0072] Further, one of skill in the art realizes that other methods well known in 
the art can be used to develop protein databases for example, but not limited to Monte Carlo 
sampling method. The Monte Carlo sampling method is well known and used in the art (Pan 
et aU 2000). 
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EXAMPLES 



[0073] The following examples are included to demonstrate preferred 
embodiments of the invention. It should be appreciated by those skilled in the art that the 
techniques disclosed in the examples which follow represent techniques discovered by the 
inventor to function well in the practice of the invention, and thus can be considered to 
constitute preferred modes for its practice. However, those of skill in the art should, in light 
of the present disclosure, appreciate that many changes can be made in the specific 
embodiments which are disclosed and still obtain a like or similar result without departing 
from the concept, spirit and scope of the invention. 

Example 1 
Selection of proteins used in dataset 

[0074] A database of 44 proteins, 2922 residues total (Table 1), was selected 
from the Protein Data Bank on the basis of biological and computational criteria. The two 
biological criteria were that the proteins be globular and nonhomologous with every other 
member of the set as ascertained by SCOP (Murzin et al., 1995). The first computational 
criterion was that the proteins be small (less than about 90 residues), because the CPU time 
and data storage needs of an exhaustive COREX calculation increased exponentially with the 
chain length. The second computational criterion was that the structures be mostly devoid of 
ligands, metals, or cofactors, as the COREX energy function was not parameterized to 
account for the energetic contributions of non-protein atoms. The database was comprised of 
24 x-ray structures, whose resolution ranged from 2.60 to 1.00 A (median value of 1.65 A). 
Twenty NMR structures completed the database. An independent database of 50 proteins 
(3304 residues total) that were not included in the above set, was created from the PDBSelect 
database (Hobohm & Sander, 1996). This second database was used as a control to check the 
results obtained from the first database, as shown in Figure 7. 



25112195.1 



in 



i y 



few? 



4> 



S 

a 



o 

Ph 



Q 

a> 

§ 

a> 
a 

0* 

CM 

fl 

o 

<-> 

53 

"5 

CM 

C« 

U 

Oh 

o 

u 



3 

as 

H 



O w 

3 z 



p a 



g O 

CO ^ 



oo 



1 

8 

U 

oo 



s s 

On H 

o 
as 
w 
-J 



s 

o 
-a 

d 

1 

I 

Q 

CU 



CU 
O 
CU 



o 



NO 



s 

o 

f 

.s 

< 

Q 



Oh 

rt 

-a 



no 



CO 



d 

1 

CUD 



on ~a 

J2 



oo 



o 



NO 



OO 



no 



O 



O 
oo 



oo 

NO 



no 



ON 

oo 



as 
no 



oo 



no 



^O 



On 



On 
no 



s 



rt 

CO 



oo 

rt 

*o 

S-i 

i 

H 

is 

rt 

a 

o 
a 

8" 

O 



d 

CU 

I 

rt 
"rt 

d 



CU 
-i— i 

u 



g 



d 
3 



cu 
^4 



| 

rt 

u 



o 
-9 



o 

CU 

s 

o 

o 
o 

4— > 



I 

s 

o 

< 

CO 



.s 

rt 

s 

o 
X 



3 



O 

a* 



So 



cu 



cu 

x 
o 

CU 
D 



Oh 

PC 



d 

i 

cu 

o 
u 

.3 

X 

2 
o 

rt 



on 



cu 
7* 



X 

o 

g 

<U 

PL, 



-a 
S 

o 



o 
oo 



oo 



oo 



On 
OO 



ON 



OO 



On 



On 

NO 



O 



On 



d 



8* 
-a 



in 



NO 



NO 



OO 



On 



O 



5 



ON 



o 



oo 



CO 
CO 



ON 



o 



On 



On 

to 



oo 



oo 



oo 



NO 



UO 



NO 



On 



o 



SJ 



E 

a 



nj 



CD 
g 



On 



X 

o 
O 

s 

PC 



On 



CO 

O 

a, 

as 

(70 



53 

4—1 

2 

J— i 
Oh 

EE 



s 



CUD 

On 



2 



> 
£ 

g 

a 

o 

=3 



a 

o 



oo 



oo 



d 



■a 



NO 



vo 



X 

4-> 



oo 



NO 



NO 



CO 

Z 

CO 
► — I 

CO 



NO 
ITN 



cr 



cr 



o 

CO 



CO 



CD 



O 
O 



3 



Ql, 

e 



NO 



1 

.a 

CO 

•i 

o 

CD 

d 



i 

CD 
O 

l 

CD 

Q 



CO 



1 

CD 

CO 

1 

OS 



oo 



o 

O- 

CD 



.9 

CD 
j— » 
O 
J— < 

1 

O 
M 

d 

rt 
O 
to 

2 
o 



CD 

1 

"CD 

-d 

co 

.1° 
1 

< 



NO 



CO 



O 
OS 

4— > 

CD 



CO 



d ^3 



23 



d .3 



CD .CD 

is ^ :s 



■4-* frn 



bo 

c<3 



o .a 



y C » 



8 



O 
CN 



Example 2 
Computational Details 

[0075] The database of 44 nonhomologous proteins (Table 1) was analyzed 
using the COREX algorithm. The COREX algorithm (Hilser & Freire, 1996) was run with a 
window size of five residues on each protein in the database. The minimum window size was 
set to four, and the simulated temperature was 25 °C. 

[0076] Briefly, COREX generated an ensemble of partially unfolded 
microstates using the high-resolution structure of each protein as a template (Hilser & Freire, 
1996). This was facilitated by combinatorially unfolding a predefined set of folding units 
(i.e., residues 1 - 5 are in the first folding unit, residues 6-10 are in the second folding unit, 
etc.). By means of an incremental shift in the boundaries of the folding units, an exhaustive 
enumeration of the partially unfolded species was achieved for a given folding unit size. The 
entire procedure is shown schematically in Figure 1A for ovomucoid third domain (OM3), 
one of the proteins in the database (PDB accession code 2ovo). 

[0077] For each microstate i in the ensemble, the Gibbs free energy was 
calculated from the surface area-based parameterization described previously (D'Aquino, 
1996; Gomez, 1995; Xie, 1994; Baldwin, 1986; Lee, 1994; Habermann, 1996). The 
Boltzmann weight of each microstate [i.e., Ki = exp(-AG/RT)] was used to calculate its 
probability: 

[0078] where the summation in the denominator is over all microstates. From 
the probabilities calculated in Equation 1, an important statistical descriptor of the 
equilibrium was evaluated for each residue in the protein. Defined as the residue stability 
constant, Kfo, this quantity was the ratio of the summed probability of all states in the 
ensemble in which a particular residue j was in a folded conformation (LP/j) to the summed 
probability of all states in which j was in an unfolded conformation (LP n fj)> 



K 



(2) 
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[0079] From the stability constant, a residue-specific free energy was written 

as: 



AG fJ =-RT*lrtK fj (3) 

[0080] Equation 3 reflects the energy difference between all microscopic 
states in which a particular residue was folded and all such states in which it is unfolded. 

[0081] The Gibbs energy for each microstate i relative to the fully folded 
structure was calculated using Equation 4: 

AGi = AHi, solvation -T(ASr, solvation + WASi, conformational) (4) 

[0082] where the calorimetric enthalpy and entropy of solvation were 
parameterized from polar and apolar surface exposure, and the conformational entropy was 
determined as described previously (Hilser & Freire, 1996). The maximum stability for each 
protein was normalized to a common arbitrary value of approximately 6.2 kcal/mol (max ]n/c f 
= 10.4) by adjusting its conformational entropy factor, W, in Equation 4. The average 
entropy factor required for the normalization was 0.81 +0.19 (mean ± s.d.) over the 44 
proteins. It was an empirical observation that adjustment of a stable protein's conformational 
entropy factor did not change the relative patterns of high and low stability regions in the 
structure. 

Example 3 

Comparison of Residue Stability Constant to 
Hydrogen Exchange Protection Factors 

[0083] Prediction of the hydrogen exchange protection factors of the residues 
that exchange protons was performed by calculation of the ensemble of P fJ and V fiCXj values. 

[0084] Briefly, the protection factor for any given residue j was defined as the 
ratio of the sum of the probabilities of the states in which residue j was closed, to the sum of 
the probabilities of the states in which residue j was open: 



pp.— - /closed , j 

2jPi Popen J 



(5) 
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[0085] The statistical definition of the protection factors has the same form as 
that of the stability constants (equation (2)) and was expressed in terms of the folding 
probabilities as follows: 

pF <= ? tJ ~ p T s (6) 

Pn,f,j + Pf t xc t j 

[0086] The correction term Pf xcj was the sum of the probabilities of all states in 
which residue j was folded, yet exchange competent. 

[0087] Figure 2 shows the comparison of hydrogen exchange protection factors 
predicted from COREX data with experimental values for OM3. The agreement in the 
location and relative magnitude of the protection factors with the stability constants for this 
and other proteins suggested that the calculated native state ensemble provided a good 
description of the actual ensemble (Hilser & Freire, 1996). It naturally follows that the 
residue stability constants of a particular protein provided a good description of the 
thermodynamic environment of each residue in that structure. 

[0088] Further inspection of Figure 2 revealed another important feature in the 
pattern of residue stability constants. Namely, the stability constants varied significantly 
across a given secondary structural element, as observed for alpha helix 1 of OM3. The 
protection factors (and stability constants) were high at the N-terminal region of helix 1 , but 
decreased over the length of the helix. This indicated that secondary structure, or other 
structural classifications, do not obligatorily coincide with thermodynamic classifications. 
This result has potentially important consequences for cataloging propensities of amino acids 
in different environments. For example, in OM3 two threonine residues were located in 
different structural environments; Thr 47 was part of the loop that follows alpha helix 1, 
while Thr 49 was part of beta strand 3. In spite of the different structural environments for 
the two threonine residues, the stability constants and, more importantly, the experimental 
protection factors demonstrated that both residues, to a first approximation, share the same 
thermodynamic environment. 

Example 4 
Binning of Residue Stability Constants 

[0089] Inspection of each protein's ln/cf data indicated that these were the three 
stability classes: high, medium, and low stability. The cutoffs for each stability class were 
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adjusted so that an approximately equal number of residues in the database fell in each class 
(Table 2). The low stability category was defined as ln/^ <= 3.99, the medium stability 
category was defined as 3.99 < ln/^<= 7.14, and the high stability category was defined as 
ln^ > 7.14. Statistics of amino acid type as a function of each of these stability categories 
were tabulated (Table 2), and normalized histograms of these numbers are shown in Figure 
3A-Figure 3T. 
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3 A-Figure 3T. In addition, these values (minus the values for a given target protein) were 
used to compute the InK/ scoring matrices. 

[0090] Striking asymmetries were often observed for the histograms of certain 
amino acids across the three stability environments, and these asymmetries were well outside 
the standard deviation of the average of three random data sets. For example, the aromatic 
amino acids Phe, Trp, and Tyr were mostly found in high stability environments, while Gly 
and Pro were overwhelmingly found in low stability environments. In contrast, other 
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residues such as Ala, Met, and Ser exhibited distributions that did not significantly differ 
from randomized data. 



[0091] Although the acidic residues Asp and Glu shared a slight tendency to be 
found in medium stability environments, it was observed that several amino acid pairs having 
nominally similar chemical characteristics partition differently in the stability environments. 
For example, the basic residues Arg and Lys exhibited opposite stability characteristics: the 
counts for Arg increased as the stability class increased, but the counts for Lys decreased as a 
function of stability class. While Asn was found less often in high stability environments, 
Gin was found more often in them. Although the distribution for Ser did not differ 
significantly from the randomized data, Thr occurred more often in low stability 
environments and less often in high stability environments. Somewhat surprisingly, the 
aliphatic amino acids He, Leu, and Val did not show a general pattern, except perhaps a slight 
disfavoring of low stability environments. 

Example 5 

Calculation of Average Native State Side Chain Area Surface Exposure 

[0092] Average side chain area surface area exposure of residue j over a 
window size of five residues, ASAaveragej, was calculated using Equation 7: 

ASA nativej 

ASA auer ^ CJ = — - (7) 

[0093] Because Equation 7 was undefined for the first and last two residues in 
each protein, these four residues were ignored in the binning. The cutoffs for each side chain 
area class were adjusted so that an approximately equal number of residues fell in each class. 
The low exposure category was defined as AS Aaveragej <= 43.31 A 2 , the medium exposure 
category was defined as 43.31 A 2 < ASA aver a g ej <= 59.86 A 2 , and the high exposure category 
was defined as ASAaveragej > 59.86 A 2 . 

[0094] As shown in Figure 4, frequencies of amino acids found in COREX 
stability environments were not correlated to frequencies of amino acids in exposed surface 
area environments. This was important as it suggested that the thermodynamic information 
calculated by the COREX algorithm was not simply monitoring a static property of the 
structure, but instead was capturing a property of the native state ensemble as a whole. 
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Example 6 
Random DataSets 

[0095] For comparison to the COREX and DSSP data sets from the 44 non- 
homologous proteins in the database, control data sets were constructed by randomizing (i.e., 
shuffling) the calculated stability and the secondary structure data. The random data sets 
therefore contained the same amino acid composition, counts of high, medium, and low 
stabilities, and types of secondary structure, as the real data sets. However, any correlation 
between residue type or secondary structural class was presumably destroyed by 
randomization. To assess internal variability of the data due to differing numbers of counts 
of each residue type, the results from three randomized data sets were averaged and standard 
deviations calculated; these data are plotted in Figure 3A-Figure 3T. 

Example 7 
Construction of Scoring Matrices 

[0096] The scoring matrices were calculated as log-odds probabilities of finding 
residue type j in structural environment £, as described below and in (Bowie et al, 1991). 
The matrix score, S Jt k, was defined as: 

S j>k = ln^L_ (8) 
k 

[0097] In Equation 8, P, | k was the probability of finding a residue of type j in 
stability class k {i.e., number of counts of residue type j in stability class k divided by the total 
number of counts of residue type/), and P* was the probability of finding any residue in the 
database in stability environment k (i.e., number of residues in stability class £, regardless of 
amino acid type, divided by the total number of residues in the entire database, regardless of 
amino acid type). The structural environment was described by either COREX stability 
information (high, medium, or low lmc/), or DSSP secondary structure (alpha, beta, or other) 
as given in the target's PDB entry. The fold recognition target was removed from the 
database, and the remaining 43 proteins were used to calculate the scores; therefore, 
information about the target was never included in the scoring matrix. The values in Tables 
3 A and 3B are the average ± standard deviation of all 44 individual scoring matrices. 
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[0098] The scoring matrices derived from COREX stability and secondary 
structure, averaged over all 44 target proteins, are shown in Tables 3A and 3B, respectively. 
The stability matrix scores faithfully reflected the histograms shown in Figure 3A-Figure 3T; 
for example, Gly and Pro scored unfavorably in high stability environments but scored 
favorably in low stability environments. Similarly, the secondary structure matrix scores 
followed intuitive notions of secondary structure propensity; for example, Ala scored 
positively in helical environments, the aromatics scored positively in beta environments, and 
Gly and Pro scored negatively in both alpha and beta environments. The standard deviations 
in both matrices were generally small as compared to the magnitude of the scores, suggesting 
that the scores were not affected by the removal of any one protein from the database. 

Example 8 
Fold-Recognition Details 

[0099] Fold-recognition experiments were based on the profile method 
pioneered by Eisenberg and co-workers (Gribskov et al, 1987; Bowie et al, 1991). 

[0100] Briefly, the method characterized each residue position of a target 
protein in terms of a structural environment score derived from analysis of a database of 
known structures. The resulting profile of the target protein was then optimally aligned to 
each member of a library of amino acid sequences by maximizing the score between the 
sequence and the profile. Two structural environment scoring schemes were developed: one 
based on calculated COREX stability, and one based on DSSP secondary structure (Kabsch 
& Sander, 1983) as contained in each target protein's PDB file. Each scoring scheme had 
three dimensions as a function of the 20 amino acids: high, medium, and low stability for 
COREX scoring, or alpha, beta, and other for secondary structure scoring. Two alignment 
algorithms were used: a local scheme (Smith & Waterman, 1981) as implemented in the 
PROFILESEARCH software package (Bowie et al, 1991), and a global scheme. The global 
alignment scheme simply paired the first residue of an amino acid sequence with the first 
position of a target profile, with no allowance for gaps. This scheme was possible because 
the amino acid sequence lists against which the targets were threaded only included 
sequences of identical length to each target corresponding to monomelic structures from the 
PDB. The total number of identical length sequences for each target ranged from 6 to 35, 
with an average of 19 ± 8 sequences per target (Table 1). No attempt was made to optimize 
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the gap opening and extension penalties for the local algorithm; in all cases these were the 
defaults given in the PROFILESEARCH package, 0.1 and 0.05, respectively. 

[0101] The results of the fold recognition experiments are shown in Figure 5 A, 
Figure 5B, Figure 5C and Figure 5D, and at least three conclusions are drawn from this data. 
First, scoring matrices composed of either COREX stability or DSSP secondary structure 
data performed better than randomized data sets in matching a structural target to its amino 
acid sequence. In Figure 5A, Figure 5B, Figure 5C and Figure 5D, the results for COREX 
data are stacked toward the left (successful) side of the rankings, while the randomized data 
approaches a bell-shaped distribution with a maximum near the median of the size of the 
sequence datasets (approximately 10 for the mean size of 19 sequences). Second, for both 
COREX and DSSP scoring matrices, the global algorithm (which took the entire amino acid 
sequence into account) performed significantly better than the local algorithm (which 
generally aligned only a subset of the sequence). Third, the total number of targets falling in 
the most successful bin was similar for both the COREX stability and secondary structure 
matrices, suggesting that COREX stability propensities alone contained a comparable amount 
of information to secondary structure propensities. 

[0102] Because the local alignment algorithms used here compute a score 
without returning the complete alignment of profile to sequence, high scores may have been 
possible from non-structurally significant local alignments. In other words, it is possible that 
a correct sequence may have scored well against its corresponding target structure without 
having placed the individual amino acids in their correct positions within the structure. The 
use of the global alignment in conjunction with amino acid sequences of identical length 
partially alleviated this problem, as no misalignment was allowed in the global scheme. 

Example 9 

Successful Alignment Based on COREX Stability 

[0103] To assess the extent of local alignments that were structurally 
significant, minor modifications were made to the PROFILESEARCH source code that saved 
the traceback of the alignment matrix. It was found that for targets scoring poorly in the fold- 
recognition rankings, local alignments of the corresponding sequence were often not 
significant. However, sequences that scored in the top two bins were often found to be 
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completely and correctly aligned with their target profiles, even though not all of their 
residues contributed to the overall score due to the rules of the local algorithm. Three 
examples of successful alignment based on COREX stability data alone are shown in Figures 
6A, 6B, 6C and Tables 4A, 4B, 4C for the targets Protein G (ligd), DNA topoisomerase I 
(lvcc), and tendamistat (2ait), respectively. The alignments calculated using the local 
algorithm were correct, despite the fact that no sequence information about the target was 
used, and that only a subset of the amino acid sequence was used in the scoring. In addition, 
it is noteworthy that the success of these examples is not due to merely a small fragment of 
the sequence, as the cumulative 3D- ID matrix score steadily increase over the entire length 
t: s of the sequence. 

SI Table 4A. Local Alignment Score of ligd Sequence to ligd Stability Profile 
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acid designation. 

a H, M, and L denote high, medium, and low stability as defined in the text and in footnote b 

of Table 3. * 
b Value of the 3D-1D scoring matrix corresponding to the results of optimal alignment ot the 
ligd amino acid sequence given in the "Residue Type" column to the ligd stability profile 
given in the "Stability Environment" column. These values are highly similar, but not 
identical, to the average values given in Table 3 A because these values are from the scoring 
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matrix produced when the target protein was removed from the database, as described in the 



• °f ValUe " in the " 3D " 1D Matrix Score " colum » »P to and including the 

J Wh i w 1 ? UC nUm ?o e «i 7^ in b ° ldfaCe Were USed h * the local alignmenttlgorithm 
JSmith & Wa erman 1981) to compute the optimal sequence to profile alignment 

Data in the Cumulative Local Alignment Score" column was used to generate Figure 5 A 
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Table 4B. Local Alignment Score of lvcc Sequence to lvcc Stability Profile 
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One of skill in the art recognizes that the Residue types are listed by the one letter amino 
acid designation. 

a H, M, and L denote high, medium, and low stability as defined in the text and in footnote b 
of Table 3. 

b Value of the 3D- ID scoring matrix corresponding to the results of optimal alignment of the 
lvcc amino acid sequence given in the "Residue Type" column to the ligd stability profile 
given in the "Stability Environment" column. These values are highly similar, but not 
identical, to the average values given in Table 3 A because these values are from the scoring 
matrix produced when the target protein was removed from the database, as described in the 
text. 

c Sum of all the values in the "3D- ID Matrix Score" column up to and including the 
indicated residue number. Values in boldface were used by the local alignment algorithm 
(Smith & Waterman, 1981) to compute the optimal sequence to profile alignment. 
d Data in the "Cumulative Local Alignment Score" column was used to generate Figure 5B. 
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* One of skill in the art recognizes that the Residue types are listed by the one letter amino 
acid designation. 

a H, M, and L denote high, medium, and low stability as defined in the text and in footnote b 

of Table 3. . 

b Value of the 3D- ID scoring matrix corresponding to the results of optimal alignment ot the 

2ait amino acid sequence given in the "Residue Type" column to the ligd stability profile 

given in the "Stability Environment" column. These values are highly similar, but not 

identical, to the average values given in Table 3 A because these values are from the scoring 

matrix produced when the target protein was removed from the database, as described in the 

text. 
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c Sum of all the values in the "3D- ID Matrix Score" column up to and including the 
indicated residue number. Values in boldface were used by the local alignment algorithm 
Smith & Waterman, 1981) to compute the optimal sequence to profile alignment. 
Data in the "Cumulative Local Alignment Score" column was used to generate Figure 5C. 
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Example 10 
State of Ensemble Using COREX 

[0104] A database of 81 proteins, 5849 residues total (Table 5), was selected 
from the Protein Data Bank (Baldwin and Rose, 1999) on the basis of biological and 
computational criteria as described previously in Example 1. 

[0105] Next, the COREX algorithm (Hilser & Freire, 1996) was run with a 
window size of five residues on each protein in the database. The minimum window size was 
U set to four, and the simulated temperature was 25 °C. The COREX algorithm generated an 

Sj ensemble of partially unfolded microstates using the high-resolution structure of each protein 

JP as a template (Hilser & Freire, 1996) similar to Example 2. This was facilitated by 

Sj combinatorially unfolding a predefined set of folding units (i.e., residues 1 - 5 are in the first 

l t folding unit, residues 6-10 are in the second folding unit, etc.). By means of an incremental 

s shift in the boundaries of the folding units, an exhaustive enumeration of the partially 

lJ unfolded species was achieved for a given folding unit size (Hilser & Frieir, 1996; Wrabl, et 

ft al., 2001). 

y § 

D 

RJ [0106] Next, the Gibbs free energy for each state, AG/ relative to the fully- 

folded reference state was calculated from surface area- and conformational entropy-based 
parameterizations described previously in Example 2 (Wrabl et al, 2001). Thus, the AG,, of 
each state arises from differences in solvation of apolar and polar surface area, and from 
differences in conformational entropy between each state and the reference state. Therefore, 
dividing the free energy into its component terms gives: 

AG, = AG apolarJ + AG polar4 + AG confSJ (9) 

[0107] As Equation 9 indicates, different values for the component 

contributions can provide similar magnitudes for AG/, suggesting that different states can 
have similar stabilities, but different mechanisms for achieving that stability. 
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Example 11 
Surface Area Calculations 



[0108] The calorimetric enthalpy and entropy of solvation were parameterized 
from polar and apolar surface exposure (Hilser & Freire, 1996). COREX uses empirical 
parameterizations to calculate the relative apolar and polar free energies of each microstate: 

*G apolar4 {T) = -8.44* AASA apolarJ + 0.45 * AASA apolarJ *(J-333) 
-T*(o.45* AASA apolarJ *ln(77385)) 

*G polarJ (T) = 31.4.44* AASA polar , -0.26* AASA poiarJ *(r^333) 
- T *(- 0.26 * A^S^, * ln(T / 335)) 

[0109] The three primary components used to calculate conformational 
entropies (AS i>CO nf) for each microstate were: (1) AS bu ->ex, the entropy change associated with 
the transfer of a side-chain that was buried in the interior of the protein to its surface; (2) 
AS ex ->u, the entropy change gained by a surface-exposed side-chain when the peptide 
backbone unfolds; and (3) AS b b, the entropy change gained by the backbone itself upon 
unfolding (Hilser & Freire, 1996). For fold recognition calculations, the total (AS />C0W /) of all 
proteins is multiplied by a scaling factor to eliminate the unfolded state contribution to the 
residue-specific thermodynamic parameters. 

[0110] Next, the residue stability constant, K f9 was calculated similar to 
Example 2. The residue stability constant is the ratio of the summed probability of all states 
in the ensemble in which a particular residue, j 9 is in a folded conformation (EPfj) to the 
summed probability of all states in which residue j is in an unfolded (i.e., non-folded) 
conformation (LP n fj). 

[0111] Equation 2, in turn, was used to define a residue-specific free energy of 
folding for the protein (^q =-RJ\ruc .)» which was ^P 311 ^ to § ivb 
( AG - RT In Q - RT In Q ) where Q«tf 311(1 QfJ were the sub-partition functions for 
states in which residue j was unfolded and folded, respectively. Thus, the residue-specific 
free energy provides the difference in energy between the sub-ensembles in which each 
residue is folded and unfolded. In other words, the residue stability constant does not provide 
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the contribution of each amino acid to the stability of a protein. Rather, it provides the 
relative stability of that region of the protein, implicitly considering the contribution of all 
amino acids in the protein toward the observed stability at that position. 

[0112] As shown in Figure 8, the stability constants provided a residue-specific 
description of the regional differences in stability within a protein structure. The importance 
of this quantity from the point of view of fold recognition is two-fold. First, the stability 
constant is compared directly to protection factors obtained from native state hydrogen 
exchange experiments, thus providing an experimentally verifiable residue-specific 
description of the ensemble. Second, as amino acids are non-randomly distributed across 
high, medium and low stability environments, the stability constant as a function of residue 
position provides a convenient 1-dimensional representation of the 3-dimensional structure. 

Example 12 

Identification of Additional Thermodynamic Determinats 

[0113] First, the AG, for each microstate i in the ensemble was composed of 
solvation and conformational entropy terms as described by Equation 9 and Example 10. 
Equation 9 was rewritten in terms of the enthalpic and entropic components: 

AGj = AHi t so i vation — T(AS iiSolvation + AS jf conformational) iX^) 

[0114] Each of the solvation terms in Equation 12 was further expanded into 
contributions based on apolar and polar surface area: 

AG ; = (AH, solv ation,apo.ar + AH, solv a„on,pol ar)-T(AS s , so lvation,apolar + ^Sj, so , v ation,polar) -T(AS,, conformational) (13) 

[0115] However, the identical values for the apolar and polar areas of each state 
were used for the respective terms in the enthalpy and entropy calculations. Therefore, the 
absolute values for the enthalpy and entropy terms for a given area type were related by 
constants ki (for apolar area) and k 2 (for polar area), yielding the expression: 

AGiHAH^no^po^ 
conformational) (^4) 
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[0116] Grouping area types together and simplifying gives: 

AG-KAH,^^ (15) 

[0117] Equation 15 revealed that for a given free energy and conformational 
entropy, the relative contribution of polar and apolar surface to the solvation free energy was 
ascertained from the ratio of polar to apolar enthalpy for each state. 

[0118] Thus, to arrive at a residue-specific contribution of polar and apolar 
solvation, a given thermodynamic parameter (i.e. enthalpy or entropy) is considered an 
average excess quantity, which represents the population-weighted contribution of all states 
in the ensemble. For instance, the average excess enthalpy and entropy was defined as: 



" states "state s 

i=l i= 

N states N s;ates 

<A5)= £^-AS,= £ 



1=1 



K t • AH i 

Q 

K t • AS i 

Q 



(16A) 
(16B) 



[0119] Following from Equations 16A and 16B, residue-specific descriptors of 
the polar and apolar enthalpy were defined accordingly. The polar component of the 
enthalpy was defined as the difference between the average excess polar enthalpy from the 
sub-ensemble in which residue j is folded (<AH pohfJ » and the average excess polar 
enthalpy from the sub-ensemble in which residue./ is unfolded (< AH polnfJ >): 



^Hpol, j = < MtpoUJ >~<^ P o,,nfJ > 



where: 



N j, folded 

■ AH polJJ > = £ 

1=1 



N j,not folded 



«=1 



{AH polJ ye-^"») 

QnfJ J 



(17) 

(18) 
(19) 
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[0120] It is important to note that the summations in Equations 18 and 19 were 
only over the sub-ensembles in which residue j was folded and unfolded, respectively, and 
the parameters Q/j and Qnfj were the sub-partition functions for those sub-ensembles. By 
identical reasoning, the residue-specific apolar component to the enthalpy of residue j and the 
residue-specific conformational entropy component of residue j were defined as: 

±H apol ,j= <AH apol , fJ >-<AH MJ > (20) 
AS COTI/ , j = < AS conAfJ >-< AS conf nf j > (21) 

[0121] As in the case with the residue stability constant, the expressions for the 
residue-specific 6K apo ij 9 Mi po ij and AS CO nfj do not provide the contributions of residue j to the 
respective overall thermodynamic properties. Instead, Equations 17, 20 and 21 reflect the 
average thermodynamic environments of that residue, accounting implicitly for the 
contribution of all the amino acids over all the states in the ensemble. 

Example 13 

Residue-Specific Thermodynamic Environments 

[0122] Using Equations 2, 17, 20, and 21, thermodynamic environments were 
empirically defined so as to systematically account for the different contributions of solvation 
and conformational entropy to the overall stability constant of each residue. As shown in 
Figure 9A-Figure 9C, three thermodynamic dimensions were considered; stability 
enthalpy (H ratioJ ), and entropy (S ratioJ ). The first dimension utilizes the stability constant 
classification (Figure 8 A and Figure 8B) defined by Equation 2. As the particular value for 
the stability constant can arise from conformational entropy or solvent related phenomena, a 
second dimension was utilized that provided the ratio of the conformational entropy to the 
total solvation free energy; 

AS f . 

S ralioJ ~ AG (22) 

" w 50lv,J 

[0123] where AG so i V j is the total residue-specific solvation component 
calculated similar to Equations 17-21. Finally, as the total solvation component can arise 
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from polar or apolar contributions, a third dimension was incorporated that provided the ratio 
of polar to apolar enthalpy described by Equations 17 and 20; 



u = ^LasLi (23) 

" ratio,] » TJ 

apolj 



[01241 Thus, the residues making up the 81 proteins (Table 5) that were 
analyzed partitioned non-randomly within the three-dimensional thermodynamic space. The 
non-random distribution of residues resulted in an empirical partitioning of the residue- 
specific data into twelve thermodynamic categories by dividing the stability data into three 
categories, the enthalpy data into two categories, and the entropy data into two categories 
(Figure 9A-Figure 9C). 

Example 14 
Binning of Thermodynamic Environments 

[01251 Each of the 5849 residues in the database were binned into one of the 
twelve thermodynamic environment classes based on their stability (*£,), enthalpy (H ratioJ ), 
and entropy (S ralioJ ) values. These thermodynamic environments were denoted by the 
following abbreviations: LLL, LLH, LHL, LHH, MLL, MLH, MHL, MHH, HLL, HLH, 
HHL, HHH. For example, residues in the LMH thermodynamic environment were binned 
into the Low (L) stability («&) class, the Medium (M) enthalpy (H ratioJ ) class, and the High 
(H) entropy ( S ratloJ ) class. The cutoffs for each thermodynamic class were defined as: 



Stability (/$/) class (L, M, or H): 

-Low K fJ (L) ^ [ \rvKfj < 7.95 ] (22) 

-Medium K fJ (M) = [ 7.95 <= lnxjy < 13.4 ] (23) 

-High Kfj (H) = [ 13.4 <= \nKfj ] (24) 

Enthalpy (H ratioJ ) class (L or H): 

Low H ratioJ (L) = [ -AH po/ < -1 .024 * AH flp - 2553 ] (25) 
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High H ratioJ (H) m [ -&H pol >= -1 .024 * AR ap - 2553 ] (26) 

Entropy (S ratioJ ) class (L or H): 

Low S ratioJ (L) w* [ -TAS cow/ < 0.125 * &G solv -3053 ] (27) 

High S ratioJ (H) = [ -TAS co „/>= 0.125 * AG, o/v -3053 ] (28) 

[0126] Visual inspection of the segregation of amino acid types as a function of 
various thermodynamic parameters extracted from the 8 1 -protein COREX database, guided 
by the development outlined above, suggested that the general classifications of stability, 
enthalpy, and entropy was reasonably divided thermodynamic space (as indicated in Figure 
9). The exact cutoffs for the twelve residue-specific thermodynamic environments used in 
the threading calculations were determined automatically by an exhaustive grid search of all 
possible. The utility of each trial set of cutoffs was initially determined from a coarse search 
of cutoff space by threading a constant subset of 8 targets in the protein database and 
recording sets of cutoffs that maximized the Z-scores and percentiles for each target. Then, a 
finer grid search over the best sets of cutoffs, threading against a subset of 20 targets for each 
trial set of cutoffs, resulted in the optimized set of cutoffs used for the threading experiments 
shown in this work. Identical cutoffs were used for the alpha/beta threading calculations, i.e. 
no special optimization was performed for the scoring of the alpha/beta experiment. 

[0127] Statistics for amino acid type as a function of each of the 
thermodynamic environments were tabulated (Table 6) and the log-odds probability for an 
amino acid type to be in each thermodynamic environment was calculated. The resulting 
histograms (Figure 10) revealed a non-random distribution of the amino acids within the 
thermodynamic environments. For example, hydrophobic residues such as He, Phe, and Val 
were observed with lower frequency in the MLL environment, while polar and charged 
amino acids such as Asp, Gin, and Lys were observed with higher frequency in this 
environment. These distributions cannot always be rationalized on the basis of side chain 
chemical properties, however, as the basic amino acids Arg and Lys exhibited very different 
propensities to occur in the MHL environment. This latter observation must be a reflection of 
the fact that ensemble-derived energetics included averaged tertiary enthalpic and entropic 
information that is not encoded by individual side chain properties alone. 
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Example 15 
Fold-Recognition Details 

[0128] Simple fold-recognition experiments were performed based on amino 
acid distributions within the twelve thermodynamic environments. 

[0129] Briefly, a profiling method was used to create thermodynamic 
environment profiles for each of the 81 proteins in the database (Bowie et al, 1991 ; Gribskov 
et al, 1987). The 81 amino acid sequences (Table 5) coding for the native structures used in 
the database (in addition to 3777 decoy sequences) were each threaded against the 81 target 
thermodynamic environment profiles. The decoy sequences were obtained from the Protein 
Data Bank and were inclusive for all sequences coding for "foldable" proteins ranging from 
35 to 100 residues. 

[0130] Next, a 3D- ID scoring matrix for each protein in the database was 
calculated, in which the scoring matrix data was simply the log-odds probabilities of finding 
amino acid types in one of the thermodynamic environment classes (Equation 30, below). 
The resulting profile of the target protein was then optimally aligned to each member of a 
library of amino acid sequences (i.e. 3858 decoy sequences) by maximizing the score 
between the sequence and the profile using a local alignment algorithm based on the Smith- 
Waterman algorithm (Smith & Waterman, 1981) as implemented in PROFILESEARCH 
(Bowie et al, 1991). No attempt was made to optimize the gap opening and extension 
penalties for the local algorithm; in all cases these were the default values given in the 
PROFILESEARCH package, 5.00 and 0.05, respectively. Z-scores were computed from 
PROFILESEARCH for each threading result from Equation (30): 

Z = (s-a)/<S> ( 3 °) 

[0131] In Equation 30, s was the PROFILESEARCH threading score of a 
sequence i when threaded against the structure corresponding to sequence i, <S> was the 
average threading score of all sequences in the database (identical in length to sequence i) 
threaded against the structure corresponding to sequence i, and a was the standard deviation 
of the scores of all sequences in the database (identical in length to sequence i) threaded 
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against the structure corresponding to sequence i. Thus, the Z-score was the number of 
standard deviations above the mean that sequence i scored against its target. 

[0132] Nearly three-fourths (60/81) of the correct sequences scored in the top 
5 th percentile when threaded against their corresponding thermodynamic environment profile 
(Figure 10), and the Z-scores (the number of standard deviations a particular sequence scored 
above the mean score of all chains of identical length) for these successful threadings ranged 
from 1.76 to 12.23 (Table 7). 
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Example 16 
Construction of Scoring Matrices 

[0133] The scoring matrices were calculated as log-odds probabilities of finding 
residue type j in structural environment k, as described below (Wrabl et al, 2001; Bowie et 
ah, 1991). The matrix score, S/,a, was defined as: 

S„-ln^ < 27 > 



[0134] Py|* is the probability of finding a residue of type./ in stability class k {i.e. 
number of counts of residue type," in stability class k divided by the total number of counts of 
residue type,), and P* is the probability of finding any residue in the database in stability 
environment k {i.e. number of residues in stability class k, regardless of amino acid type, 
divided by the total number of residues in the entire database, regardless of amino acid type). 
The structural environment used was one of the twelve COREX thermodynamic 
environments (LHH, LHL, LLH, LLL, MHH, MHL, MLH, MLL, HHH, HHL, HLH, HLL), 
as described above. The fold recognition target was removed from the database, and the 
remaining 80 proteins were used to calculate the probabilities. Therefore, information about 
the target was never included in the scoring matrix. 
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Example 17 

Thermodynamic Information is more Fundamental 
than Secondary Structure Information 

[0135] Secondary structure, although useful in the analysis and classification of 
protein folds, is an easily reportable observable that does little to explain the underlying 
physical chemistry of protein structure. In fact, secondary structure can be viewed as a 
manifestation of the backbone/side-chain van der Waals' repulsions that divide phi/psi space, 
modified by the thermodynamic stability afforded by local and tertiary interactions such as 
hydrogen bonding and the hydrophobic effect (Srinivasan & Rose, 1999; Baldwin & Rose, 
1999). Any reasonable description of the energetics of protein structure must be able to 
reflect these realities independent of secondary structural propensities of amino acids and the 
secondary structural classifications of folds. 

[01361 Although the COREX energy function accounts for specific 
interactions only in an implicit way, the results of a COREX calculation may provide deeper 
insight than secondary structure into the structural determinants of protein folds. For 
example, Figure 9C compared the thermodynamic environment profiles for an all-alpha 
protein and an all-beta protein threaded over their native folds. Visual inspection of the two 
color-coded structures revealed that different thermodynamic environments span single types 
of secondary structure, and that the same thermodynamic environment was found in different 
types of secondary structural elements. 

[0137] Thus, a threading procedure was repeated on a subset of proteins from 
the original database (Table 5), sorted by secondary structure to determine the possibility that 
the thermodynamic environments calculated by COREX represented a fundamental property 
of proteins that transcended structural classifications. 

[0138] First, a scoring table was assembled from the 31 proteins in Table 5 that 
were classified by the SCOP database as being "All alpha" proteins. Second, the 12 "All 
beta" proteins from Table 5 were threaded using the scoring table derived solely from the 
"All alpha" proteins. In other words, amino acid propensities for the thermodynamic 
environments from all-alpha proteins were used to perform fold recognition experiments on 
all-beta proteins. For more than 80% of the targets (10/12), sequences known to adopt the 
native all-beta structures scored in the top 5% of the 3858 decoy sequences, (Figure 12). 
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[0139] This result was a clear demonstration that the energetic information 
derived from the COREX calculations was independent of protein secondary structure. 
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[0141] Although the present invention and its advantages have been described 
in detail, it should be understood that various changes, substitutions and alterations can be 
made herein without departing from the spirit and scope of the invention as defined by the 
appended claims. Moreover, the scope of the present application is not intended to be limited 
to the particular embodiments of the process, machine, manufacture, composition of matter, 
means, methods and steps described in the specification. As one of ordinary skill in the art 
will readily appreciate from the disclosure of the present invention, processes, machines, 
manufacture, compositions of matter, means, methods, or steps, presently existing or later to 
be developed that perform substantially the same function or achieve substantially the same 
result as the corresponding embodiments described herein may be utilized according to the 
present invention. Accordingly, the appended claims are intended to include within their 
scope such processes, machines, manufacture, compositions of matter, means, methods, or 
steps. 
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